Section 1.2: Classifying and Storing Data Sources

DefinitionPopulation and Sample

Population

  • A target group we want to study.
  • The collection of all data from that group.
  • It is often difficult (if not impossible) to obtain all data from the population.

Sample

  • A subset of the population.
  • Should represent the population as a whole
  • It is typically easier (and sometimes only possible) to collect data from a sample.
ExampleExample 1.2.1.

Suppose you want to know what the predominant eye color in your country is. You survey a random sample of 2,500 people in your country, asking them about their eye color.

  1. Who is the population?
  2. Who is the sample?
  3. What data was collection?
  1. The population is everyone in your country.
  2. The sample is the 2,500 people who were surveyed.
  3. The data collected were participants’ eye color.

Classifying Data

DefinitionVariables
  • A data variable is a characteristic that is measured or recorded.
  • There two types of variables: categorical and numerical.
DefinitionCategorical Variables
  • A variable is categorical if it describes a quality or a class.
  • A categorical variable may use numbers as labels, but arithmetic operations on those numbers are not meaningful.
  • Examples: Eye colors, zip code, letter grades, Social Security number (SSN)
DefinitionNumerical Variables
  • A variable is numerical if it describes a quantity or a measurement.
  • Examples: height, weight, temperature
ExampleExample 1.2.2.

Classify each of the following variables as numerical or categorical.

Variable Numerical Categorical
Height of a building
Letter grade on a test
Hours of sleep each night
Students’ GPAs
Types of cars
Vegetable varieties planted in a garden
Number of vegetable varieties planted in a garden
Variable Numerical Categorical
Height of a building X
Letter grade on a test X
Hours of sleep each night X
Students’ GPAs X
Types of cars X
Vegetable varieties planted in a garden X
Number of vegetable varieties planted in a garden X

Sorting Data

DefinitionCoded Data

A coded data is data that uses numbers to represent information, which can make the data easier to record and interpret.


When a variable is binary (i.e., it has only two possible values), we often code it using 0 and 1, where 0 means false and 1 means true.

ExampleExample 1.2.3.

Suppose a local animal shelter received a litter of five surrendered puppies. A volunteer named each puppy and identified its sex in the table below. The manager wants to count the number of female puppies, so she asked you to add a new column named “Female”. How would you code this new column?

Name Sex
Daisy Female
Hazel Female
Luna Female
Milo Male
Rocky Male

If a puppy is female, assign a value of 1. Otherwise, assign a value of 0.

Name Sex Female
Daisy Female 1
Hazel Female 1
Luna Female 1
Milo Male 0
Rocky Male 0
DefinitionStacked Data

Stacked data are data values with the following characteristics:

  • Each column represents a variable.
  • Each row contains data for a single observation/individual.
  • Stacked data can store multiple variables across multiple observations.
ExampleExample 1.2.4.

The table below shows data on dogs in a local animal shelter. Each row corresponds to a single dog.

  1. Identify the variables.
  2. How many dogs are in the table?
Weight (lbs) Gender Illness
10 M N
27 F Y
6 F N
45 M N
65 M N
33 F Y
  1. The variables are weight, gender, and illness.
  2. There are 6 rows in the table; therefore, there are 6 dogs.
DefinitionUnstacked Data

Unstacked data are data with the following characteristics:

  • Data values are stored in two columns.
  • Each column represents a variable from a different group.
  • Unstacked data can only store data for two groups.
  • Each row does not correspond to the same individual or observation.
ExampleExample 1.2.5.

The unstacked table below shows the average number of hours slept over a one-week period for a sample of men and women.

Men Women
6.4 6.2
7.2 7.5
8.1 7.9
6.7 8.0
7.0
6.9